feat: auto-vectorize bf16/fp16 reduce with packed add2 intrinsics#2112
feat: auto-vectorize bf16/fp16 reduce with packed add2 intrinsics#2112kurisu6912 wants to merge 9 commits intotile-ai:mainfrom
Conversation
…l for TVM_LOG_CUSTOMIZE builds When TVM_LOG_CUSTOMIZE=1, TVM's logging.cc skips compiling TvmLogDebugSettings::ParseSpec and VerboseEnabledImpl (guarded by #if TVM_LOG_CUSTOMIZE == 0). However libtilelang.so calls these functions via the inline FromFlag(), causing a runtime symbol lookup error. Add the missing implementations to tilelang's own logging.cc.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMake ReduceOpNode lane-size aware and add 2-lane packed reduction lowering for CUDA fp16/bf16 (thread-local and batched AllReduce), introduce packed TL functors and pack/shuffle helpers, update CUDA codegen for packed lanes, and add filename-scoped runtime VLOG parsing; tests expanded for bf16 reductions. Changes
Sequence Diagram(s)sequenceDiagram
participant Thread as Thread (warp/thread)
participant PackBuf as Packed Buffer (uint1 x N)
participant AllReduce as tl::AllReduce (external builtin)
participant Clear as clear_buffer (scalar lanes)
Thread->>PackBuf: allocate/init packed identity (vsize=2) or scalar fallback
Thread->>PackBuf: thread-local accumulate (packed lanes)
Thread->>AllReduce: pack/unpack & call run_batch on packed workspace
AllReduce-->>Thread: packed reduced results
Thread->>Clear: unpack/extract lanes -> write back to clear_buffer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_reduce.py (1)
78-79: Add a codegen assertion for the packed bf16 path.These new cases only prove numerical correctness. A scalar fallback would still pass, so the optimization this PR is about can regress unnoticed unless at least one fragment→fragment bf16 case also asserts the generated source contains the packed reducer path (for example
tl::add2/ packed bf16 pair handling).Based on learnings,
testing/python/transform/test_tilelang_transform_hoist_broadcast_values.pyvalidates transforms by checking generated kernel source patterns.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_reduce.py` around lines 78 - 79, The new bf16 fragment→fragment test cases only check numerical correctness; add a codegen-level assertion in testing/python/language/test_tilelang_language_reduce.py that when a reduce with ("sum", T.bfloat16, ..., "fragment", "fragment", ...) is generated the kernel source contains the packed-bf16 reduction path (e.g. a pattern like "tl::add2" or the packed bf16 pair handling code) so the packed reducer is actually emitted; locate the fragment→fragment bf16 case and after generating the kernel source assert the expected packed-bf16 pattern is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 175-177: The packed reduction branch currently treats isSum() and
isAbsSum() the same, dropping the absolute value for abssum; change the logic in
src/op/reduce.cc so isAbsSum() is handled separately: when type->isAbsSum()
construct the reduction as add2(acc, abs(b)) (i.e., wrap the packed element `b`
with the absolute operation before adding) instead of using plain add2; update
the conditional (replace the combined if (type->isSum() || type->isAbsSum())
branch) so type->isSum() uses Call(..., tl::add2(), {acc, b}) and
type->isAbsSum() uses Call(..., tl::add2(), {acc, Call(..., tl::abs(), {b})}).
In `@src/runtime/logging.cc`:
- Line 113: The code uses settings.vlog_level_map_.emplace(name, level_val)
which silently keeps the first value on duplicate keys; change this to
explicitly handle duplicates by either overwriting with
settings.vlog_level_map_.insert_or_assign(name, level_val) (or operator[]) so
later entries take effect, or detect an existing key (find on vlog_level_map_)
and fail fast/log an error before inserting; update the surrounding log message
to reflect the actual effective level based on the chosen behavior (overwrite or
error).
- Around line 111-112: The unconditional LOG(INFO) used when parsing
TVM_LOG_DEBUG (the statement emitting "TVM_LOG_DEBUG enables VLOG statements in
'...'" in src/runtime/logging.cc) should not print to stderr for every spec
entry; replace it with a gated log such as VLOG(1) (or DLOG(INFO)) or remove it
entirely so the message only appears when verbose logging is enabled; update the
single LOG(INFO) call accordingly to use VLOG(1) or drop the emission.
- Around line 104-106: The code currently calls strtol(level.c_str(),
&end_of_level, 10) and narrows the result into int level_val without range
checking; change the logic in the parsing branch around end_of_level/level_val
to parse into a long (keep strtol), then check that end_of_level points to the
end of the string, errno != ERANGE, and that the returned long is between
INT_MIN and INT_MAX before static_cast<int>ing to level_val; if any check fails,
treat the input as malformed (reject/log error) instead of silently casting to
an incorrect VLOG level.
In `@src/target/codegen_cuda.cc`:
- Around line 4369-4389: The ExtractElement handling uses
PrintExpr(op->vectors[0]) directly and takes its address, which can produce an
invalid address-to-temporary; change to materialize the vector into an SSA
variable (use SSAGetID or the existing pattern used for packed vectors) and use
that SSA name instead of PrintExpr(...) when forming the reinterpret cast; also
add a lane bounds check (ensure lane is 0 or 1) before selecting "x"/"y" to
mirror the validation used elsewhere; update both the bfloat16 path
(enable_bf16_) and float16 path (enable_fp16_) branches to use the SSA variable
and validated lane.
---
Nitpick comments:
In `@testing/python/language/test_tilelang_language_reduce.py`:
- Around line 78-79: The new bf16 fragment→fragment test cases only check
numerical correctness; add a codegen-level assertion in
testing/python/language/test_tilelang_language_reduce.py that when a reduce with
("sum", T.bfloat16, ..., "fragment", "fragment", ...) is generated the kernel
source contains the packed-bf16 reduction path (e.g. a pattern like "tl::add2"
or the packed bf16 pair handling code) so the packed reducer is actually
emitted; locate the fragment→fragment bf16 case and after generating the kernel
source assert the expected packed-bf16 pattern is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 62642205-e2fb-4189-9a22-c1215d2e0405
📒 Files selected for processing (5)
src/op/reduce.ccsrc/op/reduce.hsrc/runtime/logging.ccsrc/target/codegen_cuda.cctesting/python/language/test_tilelang_language_reduce.py
| char *end_of_level = nullptr; | ||
| int level_val = static_cast<int>(strtol(level.c_str(), &end_of_level, 10)); | ||
| if (end_of_level != level.c_str() + level.size()) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/runtime/logging.cc | sed -n '90,120p'Repository: tile-ai/tilelang
Length of output: 1541
🏁 Script executed:
rg -A 5 -B 5 "vlog_level_map_" src/runtime/logging.ccRepository: tile-ai/tilelang
Length of output: 761
🏁 Script executed:
rg "VLOG\|vlog.*level" src/runtime/logging.cc | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check the broader function to understand the scope
sed -n '60,116p' src/runtime/logging.cc | cat -nRepository: tile-ai/tilelang
Length of output: 2293
Add range validation before narrowing strtol result to int.
On lines 105–106, strtol parses the level string as a long and directly casts to int without checking whether the value is in range. If a user provides a value outside [INT_MIN, INT_MAX], the cast silently wraps to an incorrect VLOG level instead of rejecting the input as malformed.
Suggested fix
+ errno = 0;
+ long parsed_level = std::strtol(level.c_str(), &end_of_level, 10);
+ if (errno == ERANGE || parsed_level < std::numeric_limits<int>::min() ||
+ parsed_level > std::numeric_limits<int>::max()) {
+ LOG(FATAL) << "TVM_LOG_DEBUG ill-formed at position " << tell_pos(level)
+ << ": level out of range: \"" << level << "\"";
+ }
- int level_val = static_cast<int>(strtol(level.c_str(), &end_of_level, 10));
+ int level_val = static_cast<int>(parsed_level);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/runtime/logging.cc` around lines 104 - 106, The code currently calls
strtol(level.c_str(), &end_of_level, 10) and narrows the result into int
level_val without range checking; change the logic in the parsing branch around
end_of_level/level_val to parse into a long (keep strtol), then check that
end_of_level points to the end of the string, errno != ERANGE, and that the
returned long is between INT_MIN and INT_MAX before static_cast<int>ing to
level_val; if any check fails, treat the input as malformed (reject/log error)
instead of silently casting to an incorrect VLOG level.
| LOG(INFO) << "TVM_LOG_DEBUG enables VLOG statements in '" << name | ||
| << "' up to level " << level; |
There was a problem hiding this comment.
Avoid unconditional INFO output while parsing TVM_LOG_DEBUG.
These lines print to stderr for every valid spec entry, even when the caller only wanted to configure VLOG gating. That makes the flag itself user-visible and can pollute tests or tooling that assert on stderr. Prefer VLOG/DLOG here, or drop the log entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/runtime/logging.cc` around lines 111 - 112, The unconditional LOG(INFO)
used when parsing TVM_LOG_DEBUG (the statement emitting "TVM_LOG_DEBUG enables
VLOG statements in '...'" in src/runtime/logging.cc) should not print to stderr
for every spec entry; replace it with a gated log such as VLOG(1) (or
DLOG(INFO)) or remove it entirely so the message only appears when verbose
logging is enabled; update the single LOG(INFO) call accordingly to use VLOG(1)
or drop the emission.
| } | ||
| LOG(INFO) << "TVM_LOG_DEBUG enables VLOG statements in '" << name | ||
| << "' up to level " << level; | ||
| settings.vlog_level_map_.emplace(name, level_val); |
There was a problem hiding this comment.
Handle duplicate file entries explicitly.
Line 113 uses emplace, so repeated keys keep the first level silently. That makes duplicates behave incorrectly and also leaves the preceding log message lying about the effective level. Either overwrite (insert_or_assign / operator[]) or fail fast on duplicates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/runtime/logging.cc` at line 113, The code uses
settings.vlog_level_map_.emplace(name, level_val) which silently keeps the first
value on duplicate keys; change this to explicitly handle duplicates by either
overwriting with settings.vlog_level_map_.insert_or_assign(name, level_val) (or
operator[]) so later entries take effect, or detect an existing key (find on
vlog_level_map_) and fail fast/log an error before inserting; update the
surrounding log message to reflect the actual effective level based on the
chosen behavior (overwrite or error).
| // Handle ExtractElement: extract a scalar lane from a bfloat16x2 / float16x2 | ||
| // vector (produced by packed reduction, etc.). The vector is stored as an | ||
| // opaque uint1 in the lowered code, but semantically it is a packed pair. | ||
| DataType vec_t = op->vectors.size() == 1 ? op->vectors[0].dtype() : DataType(); | ||
| bool vec_is_bf16x2 = vec_t.is_bfloat16() && vec_t.lanes() == 2; | ||
| bool vec_is_fp16x2 = vec_t.is_float16() && vec_t.lanes() == 2; | ||
| if ((vec_is_bf16x2 || vec_is_fp16x2) && op->vectors.size() == 1 && | ||
| op->indices.size() == 1) { | ||
| int lane = Downcast<IntImm>(op->indices[0])->value; | ||
| std::string vec = PrintExpr(op->vectors[0]); | ||
| if (vec_is_bf16x2) { | ||
| enable_bf16_ = true; | ||
| os << "bfloat16_t(((nv_bfloat162*)(&(" << vec << ")))->" | ||
| << (lane == 0 ? "x" : "y") << ")"; | ||
| } else { | ||
| enable_fp16_ = true; | ||
| os << "half_t(((half2*)(&(" << vec << ")))->" | ||
| << (lane == 0 ? "x" : "y") << ")"; | ||
| } | ||
| return; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Look at how reinterpret casting is typically done in this codegen
rg "reinterpret_cast" src/target/codegen_cuda.cc -B 2 -A 2 | head -40Repository: tile-ai/tilelang
Length of output: 1877
Materialize the packed vector before reinterpret-casting.
The code at line 4378 passes PrintExpr(op->vectors[0]) directly into &(...). Since op->vectors[0] is an arbitrary Expr, PrintExpr may emit a function call or other expression that evaluates to a temporary. In C++, taking the address of a temporary is invalid. Materialize the expression using SSAGetID to ensure it's a valid lvalue (this pattern is already used at line 4245–4246 for the same packed vector types).
Also add a lane bounds check: for x2 vectors, valid lanes are 0 and 1 (this validation is already done at line 4248 for similar operations).
🛠️ Suggested fix
int lane = Downcast<IntImm>(op->indices[0])->value;
+ ICHECK(lane == 0 || lane == 1)
+ << "ExtractElement on packed x2 expects lane 0 or 1, got " << lane;
- std::string vec = PrintExpr(op->vectors[0]);
+ std::string vec = SSAGetID(PrintExpr(op->vectors[0]), vec_t);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Handle ExtractElement: extract a scalar lane from a bfloat16x2 / float16x2 | |
| // vector (produced by packed reduction, etc.). The vector is stored as an | |
| // opaque uint1 in the lowered code, but semantically it is a packed pair. | |
| DataType vec_t = op->vectors.size() == 1 ? op->vectors[0].dtype() : DataType(); | |
| bool vec_is_bf16x2 = vec_t.is_bfloat16() && vec_t.lanes() == 2; | |
| bool vec_is_fp16x2 = vec_t.is_float16() && vec_t.lanes() == 2; | |
| if ((vec_is_bf16x2 || vec_is_fp16x2) && op->vectors.size() == 1 && | |
| op->indices.size() == 1) { | |
| int lane = Downcast<IntImm>(op->indices[0])->value; | |
| std::string vec = PrintExpr(op->vectors[0]); | |
| if (vec_is_bf16x2) { | |
| enable_bf16_ = true; | |
| os << "bfloat16_t(((nv_bfloat162*)(&(" << vec << ")))->" | |
| << (lane == 0 ? "x" : "y") << ")"; | |
| } else { | |
| enable_fp16_ = true; | |
| os << "half_t(((half2*)(&(" << vec << ")))->" | |
| << (lane == 0 ? "x" : "y") << ")"; | |
| } | |
| return; | |
| } | |
| // Handle ExtractElement: extract a scalar lane from a bfloat16x2 / float16x2 | |
| // vector (produced by packed reduction, etc.). The vector is stored as an | |
| // opaque uint1 in the lowered code, but semantically it is a packed pair. | |
| DataType vec_t = op->vectors.size() == 1 ? op->vectors[0].dtype() : DataType(); | |
| bool vec_is_bf16x2 = vec_t.is_bfloat16() && vec_t.lanes() == 2; | |
| bool vec_is_fp16x2 = vec_t.is_float16() && vec_t.lanes() == 2; | |
| if ((vec_is_bf16x2 || vec_is_fp16x2) && op->vectors.size() == 1 && | |
| op->indices.size() == 1) { | |
| int lane = Downcast<IntImm>(op->indices[0])->value; | |
| ICHECK(lane == 0 || lane == 1) | |
| << "ExtractElement on packed x2 expects lane 0 or 1, got " << lane; | |
| std::string vec = SSAGetID(PrintExpr(op->vectors[0]), vec_t); | |
| if (vec_is_bf16x2) { | |
| enable_bf16_ = true; | |
| os << "bfloat16_t(((nv_bfloat162*)(&(" << vec << ")))->" | |
| << (lane == 0 ? "x" : "y") << ")"; | |
| } else { | |
| enable_fp16_ = true; | |
| os << "half_t(((half2*)(&(" << vec << ")))->" | |
| << (lane == 0 ? "x" : "y") << ")"; | |
| } | |
| return; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/target/codegen_cuda.cc` around lines 4369 - 4389, The ExtractElement
handling uses PrintExpr(op->vectors[0]) directly and takes its address, which
can produce an invalid address-to-temporary; change to materialize the vector
into an SSA variable (use SSAGetID or the existing pattern used for packed
vectors) and use that SSA name instead of PrintExpr(...) when forming the
reinterpret cast; also add a lane bounds check (ensure lane is 0 or 1) before
selecting "x"/"y" to mirror the validation used elsewhere; update both the
bfloat16 path (enable_bf16_) and float16 path (enable_fp16_) branches to use the
SSA variable and validated lane.
- Add tl::pack_half2 helper in common.h that returns uint1, avoiding taking address of a temporary __pack_half2 return value - Update ShuffleNode codegen for fp16x2 to use pack_half2 instead of &(__pack_half2(...)) - Re-enable fp16 batch pack (remove bf16-only restriction in can_batch_pack) - Add SumOp_fp16x2/MaxOp_fp16x2/MinOp_fp16x2 packed reducer structs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/target/codegen_cuda.cc (1)
4375-4384:⚠️ Potential issue | 🟠 MajorMaterialize the packed vector before taking its address, and reject out-of-range lanes.
PrintExpr(op->vectors[0])can expand to a temporary, so&(... )here is not a safe lvalue. This path also silently maps any non-zero lane to.yinstead of enforcing0/1.🛠️ Proposed fix
int lane = Downcast<IntImm>(op->indices[0])->value; - std::string vec = PrintExpr(op->vectors[0]); + ICHECK(lane == 0 || lane == 1) + << "ExtractElement on packed x2 expects lane 0 or 1, got " << lane; + std::string vec = SSAGetID(PrintExpr(op->vectors[0]), vec_t); if (vec_is_bf16x2) { enable_bf16_ = true; os << "bfloat16_t(((nv_bfloat162*)(&(" << vec << ")))->"#!/bin/bash rg -n "VisitExpr_\\(const ShuffleNode \\*op" src/target/codegen_cuda.cc -A70 -B10 rg -n "SSAGetID\\(PrintExpr\\(" src/target/codegen_cuda.cc -C1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 4375 - 4384, The code is taking the address of PrintExpr(op->vectors[0]) which may be a temporary and therefore unsafe, and it also silently treats any non-zero lane as .y; fix by materializing op->vectors[0] into a local named temporary (use the same SSA/PrintExpr helper used elsewhere, e.g., generate a tmp with SSAGetID(PrintExpr(...)) or assign PrintExpr(...) to a local string temp variable) and then take the address of that local when forming the half2/nv_bfloat162 access; additionally validate the lane extracted from Downcast<IntImm>(op->indices[0])->value to only allow 0 or 1 and reject/out-of-range lanes (fail fast or emit a diagnostic), and preserve the existing flags enable_bf16_ / enable_fp16_ in the bf16/float16 branches.src/op/reduce.cc (1)
174-175:⚠️ Potential issue | 🔴 CriticalPacked
abssumstill omitsabs2.This branch still treats
sumandabssumthe same, so packedabssumcomputes a plain sum instead ofsum(abs(x)).🐛 Suggested fix
- if (type->isSum() || type->isAbsSum()) { + if (type->isSum()) { return Call(acc.dtype(), tl::add2(), {acc, b}); + } else if (type->isAbsSum()) { + return Call(acc.dtype(), tl::add2(), + {acc, Call(acc.dtype(), tl::abs2(), {b})}); } else if (type->isMax()) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 174 - 175, The branch in src/op/reduce.cc treats isSum() and isAbsSum() the same, causing packed abssum to compute a plain sum; update the branch so when type->isAbsSum() you wrap b with the abs2 operation before adding (e.g., replace the current return Call(acc.dtype(), tl::add2(), {acc, b}) with a return that uses Call(acc.dtype(), tl::add2(), {acc, Call(acc.dtype(), tl::abs2(), {b})}) when isAbsSum() so abssum computes sum(abs(x)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 442-481: The packed-path is forcing reinterpretation to
cfg->vec_dtype (see GetReducePackConfig, cfg->vec_dtype and src_writer->dtype)
which breaks mixed-precision reductions; instead guard enabling packing so it
only runs when source element dtype matches the accumulator/clear buffer dtype
(clear_buffer->dtype) or perform an explicit widening conversion after the
packed BufferLoad (src_load) rather than retyping the load in-place; update the
conditional around can_pack/need_pack_buffer (and places that set
clear_buffer_packed) to check dtype equality between src_buffer/
src_var_compressed.back()->dtype and clear_buffer->dtype, and if dtypes differ
either skip packing or insert an explicit vectorized cast from cfg->vec_dtype to
the accumulator dtype after src_load rather than setting src_writer->dtype
directly.
In `@src/target/codegen_cuda.cc`:
- Around line 4362-4363: The generated code wraps tl::pack_half2 in an extra
uint1{...} causing nested uint1 initializers; update the emitter in
codegen_cuda.cc that currently does os << "uint1{tl::pack_half2(" << e0 << ", "
<< e1 << ")}"; to instead emit os << "tl::pack_half2(" << e0 << ", " << e1 <<
")"; so the output uses the single uint1 returned by tl::pack_half2 (also check
similar pack emitters to match the __pack_half2 style).
---
Duplicate comments:
In `@src/op/reduce.cc`:
- Around line 174-175: The branch in src/op/reduce.cc treats isSum() and
isAbsSum() the same, causing packed abssum to compute a plain sum; update the
branch so when type->isAbsSum() you wrap b with the abs2 operation before adding
(e.g., replace the current return Call(acc.dtype(), tl::add2(), {acc, b}) with a
return that uses Call(acc.dtype(), tl::add2(), {acc, Call(acc.dtype(),
tl::abs2(), {b})}) when isAbsSum() so abssum computes sum(abs(x)).
In `@src/target/codegen_cuda.cc`:
- Around line 4375-4384: The code is taking the address of
PrintExpr(op->vectors[0]) which may be a temporary and therefore unsafe, and it
also silently treats any non-zero lane as .y; fix by materializing
op->vectors[0] into a local named temporary (use the same SSA/PrintExpr helper
used elsewhere, e.g., generate a tmp with SSAGetID(PrintExpr(...)) or assign
PrintExpr(...) to a local string temp variable) and then take the address of
that local when forming the half2/nv_bfloat162 access; additionally validate the
lane extracted from Downcast<IntImm>(op->indices[0])->value to only allow 0 or 1
and reject/out-of-range lanes (fail fast or emit a diagnostic), and preserve the
existing flags enable_bf16_ / enable_fp16_ in the bf16/float16 branches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 00418980-17e6-4611-b343-942ca14bf682
📒 Files selected for processing (4)
src/op/reduce.ccsrc/target/codegen_cuda.ccsrc/tl_templates/cuda/common.hsrc/tl_templates/cuda/reduce.h
| os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}"; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n "pack_half2|uint1\\{tl::pack_half2" src/target/codegen_cuda.cc src/tl_templates/cuda/common.h -C2Repository: tile-ai/tilelang
Length of output: 2857
Remove the extra uint1{} around tl::pack_half2.
tl::pack_half2 at line 680 in src/tl_templates/cuda/common.h already returns uint1, so wrapping it again produces nested initialization uint1{uint1{...}} in the generated code. Other pack operations in the same file use __pack_half2 without extra wrapping.
Proposed fix
- os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}";
+ os << "tl::pack_half2(" << e0 << ", " << e1 << ")";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}"; | |
| } | |
| os << "tl::pack_half2(" << e0 << ", " << e1 << ")"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/target/codegen_cuda.cc` around lines 4362 - 4363, The generated code
wraps tl::pack_half2 in an extra uint1{...} causing nested uint1 initializers;
update the emitter in codegen_cuda.cc that currently does os <<
"uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}"; to instead emit os <<
"tl::pack_half2(" << e0 << ", " << e1 << ")"; so the output uses the single
uint1 returned by tl::pack_half2 (also check similar pack emitters to match the
__pack_half2 style).
- Replace ReducePackConfig with simple GetPreferedVectorizedSize(DataType, Target) -> int - Merge getBatchReducerSuffix into MakeCodegenReducer(vsize) - MakeReduce(vsize, acc, b) -> optional<PrimExpr> - MakeInitValue(vsize) handles Broadcast internally - Guard nan_propagate in MakeCodegenReducer vsize>1 path (no packed nan ops exist)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.h`:
- Around line 130-138: The header declares std::optional in MakeReduce and
MakeCodegenReducer but doesn’t include <optional>, relying on transitive
includes; add an explicit `#include` <optional> at the top of this header (before
the declarations of MakeInitValue, MakeReduce, and MakeCodegenReducer) so
MakeReduce and MakeCodegenReducer compile reliably without depending on other
headers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b0dd0fb7-522b-4a7e-92b4-cf83fa26e1e5
📒 Files selected for processing (2)
src/op/reduce.ccsrc/op/reduce.h
🚧 Files skipped from review as they are similar to previous changes (1)
- src/op/reduce.cc
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/op/reduce.cc (3)
624-629:⚠️ Potential issue | 🔴 CriticalBatched packing still ignores
nan_propagate.Line 625 can enable packed batch AllReduce for fp16/bf16 max/min/absmax even though the packed local path already disables that case. From there, Lines 628-629 can synthesize
tl::MaxOpNan_*x2/tl::MinOpNan_*x2, butsrc/tl_templates/cuda/reduce.h:91-131only defines packedSumOp_*x2,MaxOp_*x2, andMinOp_*x2.🐛 Proposed fix
- bool can_batch_pack = vsize > 1 && batch >= vsize && batch % vsize == 0; + bool can_batch_pack = + !nan_propagate && vsize > 1 && batch >= vsize && + batch % vsize == 0;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 624 - 629, The batched pack condition enables packed reducers even when nan_propagate should disable them; update the logic around GetPreferedVectorizedSize/GetPreferedVectorizedSize(...)->vsize and can_batch_pack so that can_batch_pack is false when nan_propagate is set for float16/bfloat16 and for max/min/absmax reducers (i.e., only allow vsize>1 when nan propagation is not required), then compute eff_batch and call MakeCodegenReducer with vsize only when can_batch_pack is true (otherwise pass 1) so MakeCodegenReducer does not synthesize tl::MaxOpNan_*x2 / tl::MinOpNan_*x2 for unsupported packed nan ops.
431-468:⚠️ Potential issue | 🔴 CriticalGuard packed local lowering on matching source/accumulator dtypes.
The scalar path handles mixed precision by casting
rhs, but Lines 466-468 force the source load tovec_dtypein place. Ifsrc_buffer->dtype != clear_buffer->dtype, this becomes a raw reinterpretation instead of a widening conversion.🐛 Proposed fix
- if (vsize > 1 && !src_var_compressed.empty() && !nan_propagate) { + if (vsize > 1 && !src_var_compressed.empty() && !nan_propagate && + src_buffer->dtype == clear_buffer->dtype) { auto *ext = src_var_compressed.back()->dom->extent.as<IntImmNode>(); if (ext && ext->value >= vsize && ext->value % vsize == 0) { can_pack = true; DataType vec_dtype = clear_buffer->dtype.with_lanes(vsize);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 431 - 468, The code currently forces the vectorized load by mutating src_writer->dtype to vec_dtype which reinterprets bits when src_buffer->dtype != clear_buffer->dtype; instead, check if src_buffer->dtype == clear_buffer->dtype before changing dtype, and otherwise create an explicit widening Cast of the scalar load to vec_dtype (e.g., replace mutating src_writer->dtype with a new expr like Cast(vec_dtype, src_load) or produce a Cast of each lane) so mixed-precision is handled as a proper conversion; update the use of src_load/src_writer to use the Cast when types differ and only set src_writer->dtype when the underlying buffer dtypes already match.
172-173:⚠️ Potential issue | 🔴 CriticalPacked
abssumstill dropsabs().Line 172 folds
isAbsSum()into the same packed reducer asisSum(), so any 2-lane packedabssumbecomes a plain sum. The scalar path on Lines 144-145 already shows the intended semantics.🐛 Proposed fix
- if (type->isSum() || type->isAbsSum()) { - return Call(acc.dtype(), tl::add2(), {acc, b}); + if (type->isSum()) { + return Call(acc.dtype(), tl::add2(), {acc, b}); + } else if (type->isAbsSum()) { + return Call(acc.dtype(), tl::add2(), + {acc, Call(acc.dtype(), tl::abs2(), {b})});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 172 - 173, The packed reducer branch incorrectly treats type->isAbsSum() like isSum(), causing packed abssum to drop the abs; change the conditional so isAbsSum() is handled separately: when type->isSum() keep using Call(acc.dtype(), tl::add2(), {acc, b}), but when type->isAbsSum() compute the absolute of the incoming lane (e.g. Call(acc.dtype(), tl::abs(), {b}) or equivalent) and then Call(acc.dtype(), tl::add2(), {acc, abs_b}); update the branch that currently checks type->isSum() || type->isAbsSum() to distinguish the two cases using the unique symbols type->isAbsSum(), acc, b, tl::add2(), and tl::abs() so packed abssum preserves abs() semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/op/reduce.cc`:
- Around line 624-629: The batched pack condition enables packed reducers even
when nan_propagate should disable them; update the logic around
GetPreferedVectorizedSize/GetPreferedVectorizedSize(...)->vsize and
can_batch_pack so that can_batch_pack is false when nan_propagate is set for
float16/bfloat16 and for max/min/absmax reducers (i.e., only allow vsize>1 when
nan propagation is not required), then compute eff_batch and call
MakeCodegenReducer with vsize only when can_batch_pack is true (otherwise pass
1) so MakeCodegenReducer does not synthesize tl::MaxOpNan_*x2 / tl::MinOpNan_*x2
for unsupported packed nan ops.
- Around line 431-468: The code currently forces the vectorized load by mutating
src_writer->dtype to vec_dtype which reinterprets bits when src_buffer->dtype !=
clear_buffer->dtype; instead, check if src_buffer->dtype == clear_buffer->dtype
before changing dtype, and otherwise create an explicit widening Cast of the
scalar load to vec_dtype (e.g., replace mutating src_writer->dtype with a new
expr like Cast(vec_dtype, src_load) or produce a Cast of each lane) so
mixed-precision is handled as a proper conversion; update the use of
src_load/src_writer to use the Cast when types differ and only set
src_writer->dtype when the underlying buffer dtypes already match.
- Around line 172-173: The packed reducer branch incorrectly treats
type->isAbsSum() like isSum(), causing packed abssum to drop the abs; change the
conditional so isAbsSum() is handled separately: when type->isSum() keep using
Call(acc.dtype(), tl::add2(), {acc, b}), but when type->isAbsSum() compute the
absolute of the incoming lane (e.g. Call(acc.dtype(), tl::abs(), {b}) or
equivalent) and then Call(acc.dtype(), tl::add2(), {acc, abs_b}); update the
branch that currently checks type->isSum() || type->isAbsSum() to distinguish
the two cases using the unique symbols type->isAbsSum(), acc, b, tl::add2(), and
tl::abs() so packed abssum preserves abs() semantics.
b081ac5 to
b1f3298
Compare
|
@regression-perf |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
src/target/codegen_cuda.cc (2)
4371-4371:⚠️ Potential issue | 🟠 MajorRemove the extra
uint1{}aroundtl::pack_half2.
tl::pack_half2already returnsuint1, so this emits a nested initializer (uint1{uint1{...}}) in generated CUDA.Proposed fix
- os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}"; + os << "tl::pack_half2(" << e0 << ", " << e1 << ")";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` at line 4371, The generated CUDA currently emits a nested initializer by wrapping tl::pack_half2(...) with an extra uint1{...}; remove that outer wrapper so the code emits tl::pack_half2(e0, e1) directly; update the site that writes to the output stream (the os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}"; call in codegen_cuda.cc) to instead emit just tl::pack_half2 with the same e0 and e1 expressions.
4384-4393:⚠️ Potential issue | 🟠 MajorMaterialize the packed value before extracting a lane.
This path takes
&(PrintExpr(...)), so any non-lvalue expression here generates an address-of-temporary in CUDA. It also needs the same0/1lane validation used by the other packed-lane extractors.Proposed fix
int lane = Downcast<IntImm>(op->indices[0])->value; - std::string vec = PrintExpr(op->vectors[0]); + ICHECK(lane == 0 || lane == 1) + << "ExtractElement on packed x2 expects lane 0 or 1, got " << lane; + std::string vec = SSAGetID(PrintExpr(op->vectors[0]), vec_t);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/target/codegen_cuda.cc` around lines 4384 - 4393, The code is taking the address of PrintExpr(op->vectors[0]) which can produce an address-of-temporary; refactor the branch handling vec_is_bf16x2 / else to first materialize the packed vector expression into a temporary lvalue (same approach used by other packed-lane extractors) and then take the address of that temp when building the half2/nv_bfloat162 access; also add the same lane validation (ensure op->indices[0] is 0 or 1) used by the other packed-lane extractors before selecting "x" or "y", and ensure enable_bf16_/enable_fp16_ are still set accordingly.src/op/reduce.cc (1)
422-459:⚠️ Potential issue | 🔴 CriticalGuard packed lowering on matching source and accumulator dtypes.
This path still force-retypes
src_loadtovec_dtypederived fromclear_buffer->dtype. For mixed-precision reductions that becomes a raw vector reinterpretation of the source buffer instead of a conversion.Proposed fix
- if (vsize > 1 && !src_var_compressed.empty()) { + if (vsize > 1 && !src_var_compressed.empty() && + src_buffer->dtype == clear_buffer->dtype) { auto *ext = src_var_compressed.back()->dom->extent.as<IntImmNode>(); if (ext && ext->value >= vsize && ext->value % vsize == 0) { can_pack = true; DataType vec_dtype = clear_buffer->dtype.with_lanes(vsize);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/op/reduce.cc` around lines 422 - 459, The code currently forces reinterpretation of the source load to vec_dtype (derived from clear_buffer->dtype) which breaks mixed-precision reductions; restrict the packed-lowering path so it only enables when the source element dtype matches the accumulator/clear_buffer dtype. Concretely, augment the can_pack condition (around GetPreferedVectorizedSize(...) and src_var_compressed usage) with a dtype equality check (compare the source buffer element dtype — e.g., via src_buffer->dtype or src_var_compressed.back()->dtype — against clear_buffer->dtype) and only then create clear_buffer_packed, set need_pack_buffer, and assign src_writer->dtype = vec_dtype; if dtypes differ, skip the packed path (or perform an explicit conversion instead of retyping) so src_load is not naively reinterpreted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/op/reduce.cc`:
- Around line 132-158: The vsize==1 branch of ReduceOpNode::MakeReduce no longer
handles scalar bitwise reducers (bitand/bitor/bitxor), causing LOG(FATAL) when
scalar path calls MakeReduce(1); fix by adding cases for type->isBitAnd(),
type->isBitOr(), and type->isBitXor() in the vsize==1 block of
ReduceOpNode::MakeReduce: ensure rhs is cast to acc.dtype (as already done),
then return the appropriate bitwise expression (acc & rhs, acc | rhs, acc ^ rhs)
and only apply them for integer dtypes (or assert/handle otherwise), mirroring
how Sum/Max/Min are handled so scalar reductions succeed.
In `@testing/python/language/test_tilelang_language_reduce.py`:
- Around line 310-312: The section header string "nan_propagate tests – packed
(vsize=2) path for bf16/fp16" contains a Unicode en-dash (–) that triggers Ruff
RUF003; replace it with a normal ASCII hyphen (-) so the header reads
"nan_propagate tests - packed (vsize=2) path for bf16/fp16" to clear the lint
warning.
- Around line 319-329: The helper _make_nan_reduce_kernel should accept and
thread a batch parameter so the test can exercise the batched reducer path: add
a batch argument (default 1) to _make_nan_reduce_kernel, use T.Kernel(batch,
threads=threads) in the inner kernel, allocate dst and B with shape (M, batch)
(and dst fragment as (M, batch)), and pass batch=batch through to the reduce_fn
call (i.e., reduce_fn(src, dst, dim=1, nan_propagate=nan_propagate,
batch=batch)) so the batched reduction path is emitted.
---
Duplicate comments:
In `@src/op/reduce.cc`:
- Around line 422-459: The code currently forces reinterpretation of the source
load to vec_dtype (derived from clear_buffer->dtype) which breaks
mixed-precision reductions; restrict the packed-lowering path so it only enables
when the source element dtype matches the accumulator/clear_buffer dtype.
Concretely, augment the can_pack condition (around
GetPreferedVectorizedSize(...) and src_var_compressed usage) with a dtype
equality check (compare the source buffer element dtype — e.g., via
src_buffer->dtype or src_var_compressed.back()->dtype — against
clear_buffer->dtype) and only then create clear_buffer_packed, set
need_pack_buffer, and assign src_writer->dtype = vec_dtype; if dtypes differ,
skip the packed path (or perform an explicit conversion instead of retyping) so
src_load is not naively reinterpreted.
In `@src/target/codegen_cuda.cc`:
- Line 4371: The generated CUDA currently emits a nested initializer by wrapping
tl::pack_half2(...) with an extra uint1{...}; remove that outer wrapper so the
code emits tl::pack_half2(e0, e1) directly; update the site that writes to the
output stream (the os << "uint1{tl::pack_half2(" << e0 << ", " << e1 << ")}";
call in codegen_cuda.cc) to instead emit just tl::pack_half2 with the same e0
and e1 expressions.
- Around line 4384-4393: The code is taking the address of
PrintExpr(op->vectors[0]) which can produce an address-of-temporary; refactor
the branch handling vec_is_bf16x2 / else to first materialize the packed vector
expression into a temporary lvalue (same approach used by other packed-lane
extractors) and then take the address of that temp when building the
half2/nv_bfloat162 access; also add the same lane validation (ensure
op->indices[0] is 0 or 1) used by the other packed-lane extractors before
selecting "x" or "y", and ensure enable_bf16_/enable_fp16_ are still set
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4db743dd-b757-4804-85ab-b89071be61c1
📒 Files selected for processing (7)
src/op/builtin.ccsrc/op/builtin.hsrc/op/reduce.ccsrc/target/codegen_cuda.ccsrc/tl_templates/cuda/common.hsrc/tl_templates/cuda/reduce.htesting/python/language/test_tilelang_language_reduce.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tl_templates/cuda/reduce.h
| std::optional<PrimExpr> ReduceOpNode::MakeReduce(int vsize, const PrimExpr &acc, | ||
| const PrimExpr &b) const { | ||
| if (vsize == 1) { | ||
|
|
||
| PrimExpr rhs = b; | ||
| if (acc->dtype != rhs->dtype) { | ||
| rhs = Cast(acc->dtype, rhs); | ||
| } | ||
| const bool use_nan_op = nan_propagate && (acc.dtype().is_float16() || | ||
| acc.dtype().is_bfloat16()); | ||
| if (type->isSum()) { | ||
| return acc + rhs; | ||
| } else if (type->isAbsSum()) { | ||
| return acc + Max(rhs, -rhs); | ||
| } else if (type->isMax()) { | ||
| return use_nan_op ? Call(acc.dtype(), tl::max_nan(), {acc, rhs}) | ||
| : PrimExpr(Max(acc, rhs)); | ||
| } else if (type->isMin()) { | ||
| return use_nan_op ? Call(acc.dtype(), tl::min_nan(), {acc, rhs}) | ||
| : PrimExpr(Min(acc, rhs)); | ||
| } else if (type->isAbsMax()) { | ||
| auto abs_rhs = Max(rhs, -rhs); | ||
| return use_nan_op ? Call(acc.dtype(), tl::max_nan(), {acc, abs_rhs}) | ||
| : PrimExpr(Max(acc, abs_rhs)); | ||
| } | ||
| LOG(FATAL) << "Unsupported reduce type: " << type->type; | ||
| return std::nullopt; |
There was a problem hiding this comment.
Restore scalar bitwise reducers in MakeReduce.
The new vsize == 1 branch no longer handles bitand/bitor/bitxor, but the scalar fragment path below still calls MakeReduce(1).value(). Those reductions now hit LOG(FATAL) during lowering.
Proposed fix
if (type->isSum()) {
return acc + rhs;
} else if (type->isAbsSum()) {
return acc + Max(rhs, -rhs);
+ } else if (type->isBitAnd()) {
+ return bitwise_and(acc, rhs);
+ } else if (type->isBitOr()) {
+ return bitwise_or(acc, rhs);
+ } else if (type->isBitXor()) {
+ return bitwise_xor(acc, rhs);
} else if (type->isMax()) {
return use_nan_op ? Call(acc.dtype(), tl::max_nan(), {acc, rhs})
: PrimExpr(Max(acc, rhs));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/op/reduce.cc` around lines 132 - 158, The vsize==1 branch of
ReduceOpNode::MakeReduce no longer handles scalar bitwise reducers
(bitand/bitor/bitxor), causing LOG(FATAL) when scalar path calls MakeReduce(1);
fix by adding cases for type->isBitAnd(), type->isBitOr(), and type->isBitXor()
in the vsize==1 block of ReduceOpNode::MakeReduce: ensure rhs is cast to
acc.dtype (as already done), then return the appropriate bitwise expression (acc
& rhs, acc | rhs, acc ^ rhs) and only apply them for integer dtypes (or
assert/handle otherwise), mirroring how Sum/Max/Min are handled so scalar
reductions succeed.
| # --------------------------------------------------------------------------- | ||
| # nan_propagate tests – packed (vsize=2) path for bf16/fp16 | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
Replace the Unicode dash in the section header.
Ruff flags the – here (RUF003). Swapping it to - will clear the lint warning.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 311-311: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@testing/python/language/test_tilelang_language_reduce.py` around lines 310 -
312, The section header string "nan_propagate tests – packed (vsize=2) path for
bf16/fp16" contains a Unicode en-dash (–) that triggers Ruff RUF003; replace it
with a normal ASCII hyphen (-) so the header reads "nan_propagate tests - packed
(vsize=2) path for bf16/fp16" to clear the lint warning.
| def _make_nan_reduce_kernel(reduce_fn, M, N, dtype, threads, *, nan_propagate): | ||
| @T.prim_func | ||
| def kernel(A: T.Tensor((M, N), dtype), B: T.Tensor((M,), dtype)): | ||
| with T.Kernel(1, threads=threads): | ||
| src = T.alloc_fragment((M, N), dtype) | ||
| dst = T.alloc_fragment((M,), dtype) | ||
| T.copy(A, src) | ||
| reduce_fn(src, dst, dim=1, nan_propagate=nan_propagate) | ||
| T.copy(dst, B) | ||
|
|
||
| return kernel |
There was a problem hiding this comment.
Thread batch through the nan-reduce test helper.
The helper always emits the scalar reduction call, so test_reduce_packed_max_nan_batch_runtime below never hits run_batch or the packed batched reducer path. A batched NaN regression would still pass this suite.
Proposed fix
-def _make_nan_reduce_kernel(reduce_fn, M, N, dtype, threads, *, nan_propagate):
+def _make_nan_reduce_kernel(
+ reduce_fn, M, N, dtype, threads, *, nan_propagate, batch=1
+):
`@T.prim_func`
def kernel(A: T.Tensor((M, N), dtype), B: T.Tensor((M,), dtype)):
with T.Kernel(1, threads=threads):
src = T.alloc_fragment((M, N), dtype)
dst = T.alloc_fragment((M,), dtype)
T.copy(A, src)
- reduce_fn(src, dst, dim=1, nan_propagate=nan_propagate)
+ kwargs = {"nan_propagate": nan_propagate}
+ if batch != 1:
+ kwargs["batch"] = batch
+ reduce_fn(src, dst, dim=1, **kwargs)
T.copy(dst, B)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@testing/python/language/test_tilelang_language_reduce.py` around lines 319 -
329, The helper _make_nan_reduce_kernel should accept and thread a batch
parameter so the test can exercise the batched reducer path: add a batch
argument (default 1) to _make_nan_reduce_kernel, use T.Kernel(batch,
threads=threads) in the inner kernel, allocate dst and B with shape (M, batch)
(and dst fragment as (M, batch)), and pass batch=batch through to the reduce_fn
call (i.e., reduce_fn(src, dst, dim=1, nan_propagate=nan_propagate,
batch=batch)) so the batched reduction path is emitted.
Performance Regression Test ReportTriggered by: @kurisu6912 Results
Artifacts
|
…re/packed-reduce-bf16-add2 # Conflicts: # src/op/reduce.cc # src/op/reduce.h # src/runtime/logging.cc
Summary
Auto-vectorize bf16/fp16 fragment reduce to emit packed
add2/max2/min2instead of scalaradd.bf16. Enabled when the reduction extent is a compile-time constant divisible by the pack factor (2 for bf16/fp16), andnan_propagateis not set.Generated code (bf16 reduce_sum, 4×256, dim=1)
single-warp (32 threads)
multi-warp (128 threads)
batched AllReduce (batch=4, threads=256)
Local reduce is the same as above. After the horizontal reduce produces scalar values, pairs are packed and AllReduce uses
SumOp_bf16x2:Summary by CodeRabbit
Release Notes
New Features
Tests